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Abstract 

In this paper, we construct linear codes over Z 4 with bounded GC- 
content. The codes are obtained using a greedy algorithm over Z4. Fur¬ 
ther, upper and lower bounds are derived for the maximum size of DNA 
codes of length n with constant GC-content w and edit distance d. 
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1 Introduction 

Deoxyribonucleic acid (DNA) contains the genetic program for the bio¬ 
logical development of life. DNA is formed by strands linked together 
and twisted in the shape of a double helix. Each strand is a sequence 
of four possible nucleotides, two purines, adenine A and guanine G, and 
two pyrimidines, thymine T and cytosine G. The ends of a DNA strand 
are chemically polar with 5' and 3' ends, which implies that the strands 
are oriented. Hybridization, known as base pairing, occurs when a strand 
binds to another strand, forming a double strand of DNA. The strands 
are linked following the Watson-Crick model. Every A is linked with a 
T, and every G with a G, and vice versa. We denote the complement of 
X by X, i.e., A = T,f — A,G = C and C = G. The pairing is done in 
the opposite direction and the reverse order. For instance, the Watson- 
Crick complementary (WCC) strand of 3' — AGTTAGA — 5' is the strand 
5' - TGTAAGT -3'. 

The WCC property of DNA strands is used in DNA computing. In 
this case the data is encoded using DNA strands, and molecular biol¬ 
ogy techniques are used to simulate arithmetic and logical operations. 

The main advantages of this approach are huge memory capacity, mas¬ 
sive parallelism, and low power molecular hardware and software. Other 
applications make use of the properties of DNA [^. 

In this paper, we construct linear codes over Z 4 with bounded GG- 
content. The codes are obtained using a greedy algorithm over Z4. Fur¬ 
ther, upper and lower bounds on the maximum size of DNA codes of 
length n with constant GG-content w and edit distance d are given. 
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The choice of the ring Z 4 comes from the fact that the bounded GC- 
content and bounded edit distance properties are multiplicative over Z4. 
This is not the case over F4. The bounded GC-constraint ensures that 
all codewords have thermodynamic characteristics below some threshold. 
This is an important criteria for DNA sequences as it reduces the proba¬ 
bility of erroneous cross-hybridization. 

In [^, Chee and Ling gave an algorithm to construct DNA codes with 
large GG-content which are optimal only up to n = 12. Bishop et al. pQ 
considered the construction of random codes with fixed GG-content using 
a probabilistic model. King and Condon et al. [B] gave several upper 
and lower bounds on the maximum size of DNA codes of length n with 
constant GG-content w and Hamming distance d. It is well known that 
the Hamming distance does not capture the thermodynamic and the com¬ 
binatorial properties of DNA strand. In fact, the edit distance is a much 
more appropriate metric for designing codes for DNA computing. Thus, 
in the second part of this paper upper and lower bounds are derived for 
the maximum size of DNA codes of length n with constant GG-content w 
and edit distance d. 

The remainder of this paper is organized as follows. In Section 2, some 
preliminary results are presented. Section 3 employs a greedy algorithm 
to obtain DNA codes with bounded GG-content, and in Section 4 DNA 
lexicodes are constructed with bounded edit distance. Upper and lower 
bounds on the edit distance are also presented. In addition, examples of 
DNA codes with bounded GG-content and edit distance are given. 


2 Preliminaries 

The ring Z 4 with element {0,1,2, 3} is considered here with addition and 
multiplication modulo 4. It is a finite chain ring with maximal ideal < 2 > 
and nilpotency index 2. The Hamming weight of a codeword x in ZJ is 
defined as wh{x) = ni(x) -I- n 2 (x) -|- ns (a;), and the Hamming distance 
dH(x,y) between two codewords x and y as wh(x — y). We define the 
reverse of x = (xoxi ■ ■ ■ Xn-i) to be x^ = (xn-iXn -2 ■ ■ ■ xixo). 

The elements {0,1, 2, 3} of Z 4 are in one to one correspondence with 
the nucleotide DNA bases {A, T, G, G} by the map 0 such that 0 —>■ G, 
2 G, 3 ^ T and 1 ^ A. 

The complement of the codeword x = {xqXi ■ ■ ■ Xn-i) is the vector 
x^ = {xoXi ■■■Xn-i). The reverse complement (also called the Watson- 
Crick complement) is x^‘^ = (xn-iXn-2 ■ ■ • xixo). For x € Z 4 , x is defined 
to be <p(x). A linear code C is said to satisfy the reverse constraint, 
respectively the reverse-complement constraint if for all x G C we have 
x^ € C, respectively x^^ € C. 

2.1 Construction of Lexicodes over Z 4 

The construction of lexicodes over Z 4 given in [4] is now reviewed. A 
linear code C of length n over Z 4 is an additive code over Z 4 . Thus ZJ is 
a linear code over Z 4 with basis B = {foi ■■■bn}. With respect to this basis, 
we recursively define a lexicographically ordered list U = xi,X 2 , ■ • •,X 4 i 
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as follows 


Vo :=0 

Vi Vi—1, bi + Vi—1, 2 bi + Vi—1, Sbi + Vi—i, 1 < i < n. 

In this way \Vi\ = 4*, and Z” can be associated with Vn- Assume now 
that we have a property P which can test if a vector c £ ZJ is selected 
or not. The selection property P on V can be seen as a boolean valued 
function 

P :V ^ {True, False}, 

that depends on one variable. Over Z 4 , the property P is called a mul¬ 
tiplicative property if P[x\ is true implies P[3a:] is true. The following 
greedy algorithm provides lexicodes over Z 4 [3]. 

Algorithm 1 

1 . Co ;= 0;i := 1; 

2. select the first vector Ui G Vi\Vi-\ such that P[2ai -|- c] for all c G 
Ci-i; 

3. if such an Oi exists, then Ci ;= Ci-i, Oi -|-Ci-i, 2ai -|-Ci-i, 3ai -|- Ci-i; 
otherwise Ci ;= Ci-i; 

4. i ■.= i -\- 1-, return to 2. 

For Q < i < n, the code Ci is forced to be linear because all linear 
combinations of the selected vectors an, ■ ■ ■, an, I < i, are taken. The 
code Ci has a ‘basis’ formed from an, ■ ■ ■, an, so we have a nested sequence 
of linear codes 

0 = Co C Cl C . . . C C„. 

Cn is the lexicode and is denoted Cn — C{B,P) where B is the ordering 
and P is the selection property. We have the following result. 

Theorem 1. Theorem 4J) For any basis B of R'^ and any multiplica¬ 
tive selection criterion P, the lexicode C(B,P) is linear and P[x] holds for 
each codeword x ^ 0. 

3 A Greedy Algorithm for Bounded GC- 
content DNA Codes 

In this Section we construct DNA codes with bounded GC-content using 
Algorithm 1. We begin with the following definition. 

Definition 2. Let C be a linear code over Z 4 ’*. The GC-content of a 
codeword x £C, denoted by GC{4i{x)), is the number of occurrences of G 
and G in 4>{x) 

GG{<j){x)) = 1{1 < i < n; (j>{x)i G {G, G}}] = wgc{<I>{x)). 

We say that a subset C of If} satisfies the bounded GG-content constraint 
if there exists a positive integer w such that GG{4>{x)) > w, Va; G C. 
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Remark 3. Definitions^ differs from the conventional definition HU- 
The bounded GC-content constraint ensures that all codewords have a 
hybridization energy below some threshold, which results in stable DNA 
strands. 

Proposition 4. The property Pi[x\ is true if and only if wgc{<I>{x)) > w 
is a multiplicative property over Z 4 . 

Proof 5. Let x £ Z 4 " such that wgc(</>(®)) > w. Multiplying the vec¬ 
tor X by 3 does not change the number of 0 ’s and 2 ’s. This gives that 
WGci4>i3x)) = WGGi4>ix)) > IV, and the result follows. 

3.1 Construction Results 

In this section, construction results are presented for linear codes over 
Z4 with bounded GC-content. In this case, the verification step for 
WGc(0(2a;)) > w in Algorithm 1 can be eliminated. This is because for 
X G ZJ, wgg{.4‘{x)) > w implies that WGc(0(2a;)) > w, and this improves 
the speed of the algorithm. Some of these codes attain upper bound (5) 
given in Proposition 1]. Furthermore, the codes obtained are linear as 
opposed to those in [S]. Table 1 gives DNA lexicodes over ZJ obtained us¬ 
ing the selection property Pi[ 2 ;] {wgc{4>{x)) > w). The DNA code strands 
corresponding to the first and second codes in Table 1 are given in Tables 
2 and 3, respectively. 


4 DNA Codes and Edit Distance 

The edit distance has been used for biological computation, in particu¬ 
lar for two types of genetic mutation. The first is the substitution of 
nucleotides and consists of two possible mutations: 

• Transition: a purine is replaced by a purine (A G) or a pyrimidine 
is replaced by a pyrimidine (T -O- G). 

Transversion: a purine is replaced by a pyrimidine or the reverse 
(eg. A £^C). 

• Modification using insertions and deletions. 

In this section, we consider the edit distance in the greedy algorithm in 
order to find large sets of DNA codewords of length n with given wgc 
and minimum edit distance d. We begin by providing a definition of edit 
distance which follows the presentation in [7]. 

Let A and B be finite sets of distinct symbols and let a;‘ G .4* denote 
an arbitrary string of length t over A. The string edit distance is charac¬ 
terized by a triple < A, B, c > consisting of the finite sets A and B, and the 
primitive function c : E ^ R+ where R+ is the set of nonnegative reals, 
E = EsVJ Ed U Ei is the set of primitive edit operations, Es = A* B is the 
set of substitutions, Ed = A* E is the set of deletions, and Ei = E x B is 
the set of insertions. Each triple < A, B, c > induces a distance function 
dc : A* X B* —>■ 1R+ that maps a string a;* to a nonnegative value [7]. 
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Table 1: DNA Lexicodes over ZJ Obtained using the Selection Property Pi [ x \ 
{ wGciHx )) > w ) 


n 

W 

dn 

Basis of Z4 

Basis of C { B , P ) 

8 

4 

4 

Canonical basis 

21111000 

13210100 

32310010 

10 

6 

4 

Canonical basis 

2111100000 

1321010000 

3231001000 

10 

10 

1 

Canonical basis 

2000000000 

0200000000 

0020000000 

0002000000 

0000200000 

0000020000 

0000002000 

0000000200 

0000000020 

0000000002 

12 

12 

1 

Canonical basis 

200000000000 

020000000000 

002000000000 

000200000000 

000020000000 

000002000000 

000000200000 

000000020000 

000000002000 

000000000200 

000000000020 

000000000002 
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Table 2 : DNA Code Strands Corresponding to the Linear Code in the First 
Row of Table 1 


GGGGGGGG 

GGGGGGGG 

GGGCCCGG 

GAAAAGGG 

TGGGAAAG 

GGAAACTG 

AAACTGGG 

TTGGGAAC 

CTTGGGAA 

GGAAGTTG 

AAGGGCTT 

GGTTAAGG 

ATTGGGGA 

TTTGGGAG 

GTTTGGGA 

CATTTGGG 


GGGGGGGG 

GGGGGGGG 

GGGCCCGG 

AAAAGGGG 

CTGGGAAA 

AGTGGGAA 

AAGTGGGA 

ACTTGGGA 

AACTTGGG 

GAAGTTGG 

TTAAGGGC 

TAAGGGCT 

TTGGGCAA 

TTGGGACT 

TTTACGGG 

GGAGTTTG 


GGGGGGGG 

GGGGGGGG 

GGGAAAAG 

GGGAAACT 

CTAAAGGG 

GAAACTGG 

GGGAAGTT 

TGGGAAGT 

GGGGTTAA 

AGGGCTTA 

GTTAAGGG 

GGGTTAAG 

GGGACTTT 

TGGGACTT 

GACTTTGG 

GGGCTTTT 


GAAAAGGG 

GATTTGGG 

AAGGGGTT 

TTAAGGCG 

CGGAAAGT 

TGGGCTTT 

GGGGATTT 

TTGGGGAA 

AATTGGGG 

TAAAGGGG 

TTGGGCTT 

GGGCTTTT 

GACCCTTT 

CAATTCCG 

GAACCCTT 

GAAACCCT 


Table 3 : DNA Code Strands Corresponding to the Linear Code in the Second 
Row of Table 1 


GGGGGGGGGG 

ATCAGAGGGG 

CCGCGCGGGG 

TACTGTGGGG 

CAAAAGGGGG 

TGTGAAGGGG 

GTATACGGGG 

ACTGATGGGG 

GCCCCGGGGG 

AAGTCAGGGG 

CGCGCCGGGG 

TTGACTGGGG 

CTTTTGGGGG 

TCAGTAGGGG 

GATATCGGGG 

AGACTTGGGG 


TCTAGGAGGG 

GAACGAAGGG 

AGTTGCAGGG 

CTAGGTAGGG 

ATGCAGAGGG 

CCCAAAAGGG 

TAGGACAGGG 

GGCAATAGGG 

TGATCGAGGG 

GTTGCAAGGG 

ACAACCAGGG 

CATCCTAGGG 

AACGTGAGGG 

CGGATAAGGG 

TTCCTCAGGG 

GCGTTTAGGG 


GGCCGGCGGG 

TTGTGACGGG 

GCCGGCCGGG 

AAGCGTCGGG 

GATTAGCGGG 

AGACAACGGG 

CTTAACCGGG 

TCACATCGGG 

CCGGCGCGGG 

TACACACGGG 

CGGCCCCGGG 

ATCTCTCGGG 

GTAATGCGGG 

ACTCTACGGG 

CAATTCCGGG 

TGTGTTCGGG 


ACATGGTGGG 

CATGGATGGG 

TGAAGCTGGG 

GTTCGTTGGG 

TTCGAGTGGG 

GCGTAATGGG 

AACCACTGGG 

CGGTATTGGG 

AGTACGTGGG 

CTACCATGGG 

TCTTCCTGGG 

GAAGCTTGGG 

TAGCGGTGGG 

GGCTTATGGG 

ATGGTCTGGG 

CCCATTTGGG 


6 













Table 4 : DNA Lexicodes over Z4" Obtained using the Selection Property P2[x\ 
{dc{(j){x),(j){y)) < m) 


n 

(j){x) 

m 

WGC 

Basis of Z4 

Basis of C{B, P) 

4 

GGGG 

1 

4 

Canonical basis 

2222 

2202 

2220 

2022 

4 

GGGC 

2 

4 

Canonical basis 

2020 

0022 

0220 

2222 


Definition 6 . The edit distance dc{x^,y'’) between two strings € A* 
and y" G is defined recursively as 

( c{x\y”) + dc{x^-^,y'’-^), 
dc[x*,y'’) = min < c{x^,e) + dc(a;*“^ y”), 

I cie,y'’) + d4x\y'’-^y, 

where dc{e,e) = 0 and e denotes the empty string of length n. 

The edit distance constraint for a DNA code C is dc{x,y) > (Nx,y £ C, 
x y, for some prescribed minimum edit distance d. The edit distance 
constraint can reduce non-specific hybridization between distinct code¬ 
words, as well as allow for the correction of insertion, deletion and substi¬ 
tution errors in codewords. 


Proposition 7. The property P 2 \x\ is true only if dc{4>{x),4’{y)) ^ ^ 
a multiplicative property over Z4. 

Proof 8 . Let x € Z 4 " and y € Z 4 ”. Multiplying x by 3 and y by 3 does 
not change the number of 0 ’s and 2 ’s. Therefore the number of 1 ’s and 
3 ’s also does not change, so 

ni(x) -I- no(x) -b n 2 (x) -b ns (a;) = ni{3x) + no{3x) -b n2{3x) -b ns (3a;). 

This also holds for y and thus dc{x, y) = dc{3x, 3y). □ 

Now we use Algorithm 1 to construct linear codes over Z4 with GC- 
content bounded by w and edit distance dc{(j>{x),<l){y)) such that a; G Z4 
and y £'Ll. The results are given in Table 4- 

4.1 Upper and Lower Bounds 

Let Ai{n,d) be the maximum size of a code over Z4 with length n and 
minimum edit distance d. Let Af^ {n,d,w) be the maximum size of a 
DNA code with length n, minimum edit distance d, and fixed GC weight w. 
Further, let A^'^^ {n,d,w), respectively {n,d,w) be the maximum 

size of a DNA code with length n, minimum edit distance d, and fixed GG 
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weight w, that satisfies the reverse constraint, respectively the reverse- 
complement constraint. The purpose of this section is to give upper and 
lower bounds on these quantities. We have the following theorem. 

Theorem 9. For n > 0 with 0 < d < n and 0 < w < n, the following 
results hold. 

{n,d,0) = A 2 {n,d), ( 1 ) 

Ai^^ {n,d,w) — A^i^ {n,d,n — w), (2) 

and if w = n /2 then 

A^^ {n,d,w) = 4:. (3) 

Proof 10. The analogous result for DNA codes with GC-content and 
Hamming distance was given in m- The corresponding proof is employed 
here for the edit distance. 

(1) : Let C be a linear code over Uf with wccif’iC)) ~ 0. Then C con¬ 
tains only 0’s and 1 ’s, so C can be cxinsidered as a binary code which gives 
Ai'^^{n, d, 0 ) = A 2 {n, d). 

(2) : Since wacif’iC)) = n — wat(0(C)), interchanging the A’s with C’s 
and T’s with G’s gives wac{4>{C)) = n — w, so that Af'^{n,d,w) = 
Af^{n, d,n — w). 

(3) : Since {n,d,w) < d,w), by IKA Theorem 5] we have 

that 

{n,d,w) = 2. Then 4 < A1^{n,d,w), and by the pigeonhole prin¬ 
ciple 

A^^{n, d, w) > 4, so that A^^{n, d, w) = 4. 

We have the following relationship between the GG-content of a code 
and the code size over the alphabet {A,T,G,G}. 

Proposition 11. 

Af'^{n,d,w) > Af^{nl,d 4 - l,w). (4) 

A^'^ {n, d,w) > A^^ {n 4-l,d,w)/4. (5) 

Proof 12. The analogous result for DNA codes with unrestricted GG- 
content and Hamming distance was given in m- The corresponding proof 
is employed here for the edit distance. 

(4) : A {n,A'f‘^{n + l,d + l,w),d,w) code can be obtained from a (n + 
l,A'^'^{n + l,d + l,?r),ci + l,w) code by removing a symbol from each 
codeword such that their GC-content is preserved. 

(5) : If all the codewords in a {n-{-1, A^^ (n-\-l,d,w),d,w) code are par¬ 
titioned into four subsets according to the first symbol, one of the subsets 
will have size at least Af'^{n-\- l,d,w)/4 and thus is o (n + l,A^^'^{n-\- 
l,d,w)/4,d,w) code. By removing the (common) symbol from all code¬ 
words in the largest subset, a (n, (n + l,d,w)/4, d,w) code is ob¬ 
tained. 

We have the following relationship between the GC-content of a reverse 
code and the code size over the alphabet {A,T,C,G}. 


Proposition 13. 


^GC,i?— 1 , d, w) < d, w) < Af^’^(n, d — 1, w). ( 6 ) 

A°^’^{n-l,d,w) > A°^’^{n,d,w)/A. ( 7 ) 

Proof 14. The analogous result for DNA codes with unrestricted GC- 
content and Hamming distance was given in m- The corresponding proof 
is used here for the edit distance. 

(6) : By the construction of codes over Z 4 , we obtain 4" codewords of 
length n and 4"“^ codewords of length n — 1, and the result follows. 

(7) : The codewords of aC{n,A^^'^{n,d,w),d) — code orer Z 4 can be parti¬ 
tioned into four subsets denoted Ci, ( 72 , Ca, C4 such that the size of subset 
Cl is at least A^‘^'^(n,d,w)/A and Ci is a {n,A^'^’^{n,d,w)/4:,d) code. 
Removing a symbol from the codewords of Ci such that the distance d and 
weight w are maintained, we obtain a {n — 1, A^^’^{n,d,w),d) code, and 
the result follows. 

Proposition 15. For 0 < d < n and 0 < w < n 

aGC.RC/ , \ aGC.R/ , \ 

7 I 4 [n,d,w) = A^ (n,d,w), 


if n is even, and 

A^^'^{n, d l,w) < (n, d, w) < A^^’^(n, d—, w), 

if n is odd. 

Proof 16. The analogous result for DNA codes with unrestricted GC- 
content and edit distance was given in m- The corresponding proof is 
employed here for the edit distance. Given a set of codewords of length n, 
if we replace all entries in any subset of the positions by their complement, 
the GC-content of these codewords is preserved, as well as the edit distance 
between any pair of codewords. The edit distance between a codeword and 
the reverse or reverse-complement of the other codewords is not in general 
preserved, but if n is even and the first nj^ coordinates of each codeword 
Xi are replaced by their complements to form a new codeword yi, then 
dc{xi,x^) = dc.{yi,yf^) for all codewords Xi and xj. Similarly, if n is 
odd and the first (n — T)l2 coordinates of each codeword Xi are replaced 
by their complements to form yt, then \dc(xi,xf} — dc{yi,yf^)\ < 1. 
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